Add per-metric error handling and fix other issues #230

GJFR · 2021-08-27T12:47:25Z

Unfortunately, the well-known.js custom metric did not so well in the previous crawls due to errors. If one parseResponse call fails, the whole metric fails. To solve this, I have separated error handling for each call.

The metric would also fail when the site's policy wouldn't allow to load another page (e.g. CORS). This is now also handled for each parseResponse call. Since it is not possible to resolve the underlying cause of these errors, I just indicate an error has occurred while fetching a URL.

Lastly, I also fixed a capitalisation issue for string comparisons in /robots.txt.

I also see a lot of 'The user aborted a request.' errors in the our crawl data. Does this have to do with fetchWithTimeout?

function fetchWithTimeout(url) {
  var controller = new AbortController();
  setTimeout(() => {controller.abort()}, 5000);
  return fetch(url, {signal: controller.signal});
}

WPT tests (focus on the errors in previous crawls)

…#2211

…tric

…llow rules

GJFR · 2021-08-27T12:51:22Z

Would it be possible to merge this in time for the September crawl?

jroakes · 2021-10-13T18:23:58Z

Hi @GJFR. In discussion with @VictorLeP on slack, he pointed me to this PR. I have updated a PR for a robots.txt extraction to capture some data for next year's web almanac.

I have finalized the PR here and included test cases: #236

The current robot.txt code in the well-known extraction script has two issues:

# Works for this
user-agent: Googlebot
disallow: /login/

# Does not work for this.  Will attribute the 'login' match to only Bingbot.
user-agent: Googlebot
user-agent: Bingbot
disallow: /login/

It will match, for exampl, /book-cataloging/ and genuine-authentic-nike-shoes as secure routes.

I could update my robots.txt script to have the following output. I have already addressed #1 in my PR. I think I can make #2 better, but probably not perfect.

    "by_useragent": {
      "*": {
        "allow": 1,
        "disallow": 45,
        "crawl_delay": 0,
        "noindex": 0,
        "well_known_disallows: 5,
        "other": 0
      },
      "baiduspider": {
        "allow": 0,
        "disallow": 163,
        "crawl_delay": 0,
        "noindex": 0,
        "well_known_disallows: 2,
        "other": 0
      },
      ...
      }

I wanted to get a POV from the Security Team if we should consolidate or edit the well-known to correct some things.

GJFR · 2021-10-14T08:23:16Z

Hi @jroakes. Thanks for the heads-up. As for the issues you mentioned:

This issue should indeed be corrected, I was not aware that the second example is a valid robots.txt configuration. In the context of this year's Security-chapter, we can stil use the data luckily, since we're only interested in the endpoints that are covered.
This issue was also discussed in the original PR: Security 2021 custom metrics #219. It was decided to do the filtering at the querying stage, such that the data collected by the custom metric stays consistent over the years. Because like you mentioned, this kind of filtering is difficult to get perfect.

jroakes · 2021-10-14T12:53:41Z

Thanks @GJFR I think you make a really good point for pt.2 that I had not considered. FWIW, this is what I had tested in case it is helpful. It solves pt.1, and my solution for pt.2 was to 1) include oauth in keywords, and 2) exclude matches to matches that were preceded by a non-word character.

This also includes regex that should match:

disallow : useragent
disallow:     useragent
    disallow:  useragent

But doesn't clean first to solve for commented values (I did this in robots_txt.js):

disallow: /path/  #This is a secure route

text = $('html').textContent
let data = {'matched_disallows': {}};
let KEYWORDS = [
  'login',
  'log-in',
  'signin',
  'sign-in',
  'admin',
  'auth',
  'oauth',
  'sso',
  'account',
  'secure'
]

let currUserAgents = [];
let last_rule = null;
const rule_rx = new RegExp(`(user-agent|disallow)(?=\\s*:)`,'gi');
const kws = KEYWORDS.join('|');
const kw_rx = new RegExp(`[^\\w]+(${kws})`,'gi');

for(let line of text.split('\n')) {

  let matched = line.toLowerCase().match(rule_rx);

  if (matched){

    let rule = matched[0].trim();
    let value = line.slice(line.indexOf(':') + 1).trim();

    if (rule == 'user-agent') {
      
      if (last_rule == rule){
        currUserAgents.push(value);
      }else{
        currUserAgents = [value];
      }

    } else if (rule == 'disallow') {

      if (value.match(kw_rx)) {

        for (currAgent of currUserAgents){

          if (data['matched_disallows'][currAgent] === undefined) {
            data['matched_disallows'][currAgent] = [];
          }

          data['matched_disallows'][currAgent].push(value); 
        }

      }
    }

    last_rule = rule;
  }
  
}

JSON.stringify(data);

Here is test output for Linkedin:
https://gist.github.com/jroakes/bb5ac67ee4584f58f5a7714d62605da1

I am not that great with JS, so sharing for a POC example only that can be improved.

GJFR and others added 12 commits June 29, 2021 00:49

Rename and extend ecommerce custom metric for well-known URLs as per …

2a441a0

…#2211

Add parser for '/.well-known/security.txt' in well-known.js custom me…

c17a444

…tric

Add robots.txt data parsing to well-known.js

5aa94be

Resolve reviewer suggestions

9e13dd8

Add privacy custom metric (HTTPArchive#211)

178a23b

Replace left-over double quotes

824f546

Update 'robots.txt' in well-known.js to connect user-agents with disa…

498b919

…llow rules

Fix issue for 'robots.txt' in well-known.js

5a013f1

Filter 'robots.txt' disallows in well-known.js

b10d6e8

Merge branch 'HTTPArchive:master' into master

c82904b

Add per-metric error-handling and add fix for /robots.txt

8f86c71

Add fetch error catch (e.g. caused by site CORS policy)

a730047

This was referenced Aug 27, 2021

Ecommerce 2021 queries HTTPArchive/almanac.httparchive.org#2300

Merged

PWA 2021 queries HTTPArchive/almanac.httparchive.org#2272

Merged

Security 2021 HTTPArchive/almanac.httparchive.org#2150

Closed

rviscomi approved these changes Aug 30, 2021

View reviewed changes

rviscomi merged commit 1a7ce57 into HTTPArchive:master Aug 30, 2021

GJFR mentioned this pull request Nov 10, 2021

Improve robot_txt_sensitive_disallow.sql filter and fix version-evolution-top-technologies.sql HTTPArchive/almanac.httparchive.org#2464

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add per-metric error handling and fix other issues #230

Add per-metric error handling and fix other issues #230

GJFR commented Aug 27, 2021 •

edited

Loading

GJFR commented Aug 27, 2021

jroakes commented Oct 13, 2021

GJFR commented Oct 14, 2021

jroakes commented Oct 14, 2021 •

edited

Loading

Add per-metric error handling and fix other issues #230

Add per-metric error handling and fix other issues #230

Conversation

GJFR commented Aug 27, 2021 • edited Loading

WPT tests (focus on the errors in previous crawls)

GJFR commented Aug 27, 2021

jroakes commented Oct 13, 2021

GJFR commented Oct 14, 2021

jroakes commented Oct 14, 2021 • edited Loading

GJFR commented Aug 27, 2021 •

edited

Loading

jroakes commented Oct 14, 2021 •

edited

Loading